Perfume Dataset Analysis (R)

This repository contains exploratory data analysis of the Perfume Dataset using R.

📌 Introduction

The global fragrance industry is both highly competitive and deeply shaped by cultural and consumer preferences. Beyond aesthetics, the market reflects evolving trends in gender identity, lifestyle choices, and purchasing behaviors. For brands, understanding these dynamics is critical in designing product portfolios, targeting marketing campaigns, and identifying opportunities for innovation.

In this report, we analyze a curated dataset of perfumes covering multiple dimensions, including brand, type, category, target audience, and longevity. Our objective is to uncover patterns that reveal how different factors interact and shape consumer preferences.

🛠️ Tools

  • RStudio / R Markdown
  • R packages: readr,dplyr,tidyr,stringr,janitor,ggplot2,ggrepel,scales,plotly,tibble
# Define function to check is package is installed
install_if_missing <- function(pkg){
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org/")
    library(pkg, character.only = TRUE)
  }
}

# Install and load library
pkgs <- c("readr","dplyr","tidyr","stringr","janitor","ggplot2","ggrepel","scales","plotly","tibble","vcd")
invisible(lapply(pkgs, install_if_missing))

📂 Project Structure

  • data/ → raw dataset and cleaned dataset
  • scripts/ → R scripts for data cleaning, analysis, visualization, modeling
  • notebooks/ → R Markdown for step-by-step analysis
  • outputs/ → figures, reports
  • docs/ → research notes, methodology

❓ Research Questions

1.Market share of men’s and women’s fragrances 2.Number of perfumes under each brands 3.Market share of each category and type 4.Gender preference of category and type 5.Will type/Category influence longevity

🚀 Next Steps

  • Perform data cleaning (handle missing values, duplicates, normalize categories)
  • Conduct exploratory data analysis (EDA)
  • Build visualizations with ggplot2
  • Explore clustering/ML models (e.g., k-means, regression)

Data read and clearning

# Read in csv

perfume <- read.csv("Data/Perfumes_dataset.csv")

# Standarise
perfume <- perfume |>
  janitor::clean_names() |>
  dplyr::mutate(
    brand           = stringr::str_squish(brand),
    perfume         = stringr::str_squish(perfume),
    type            = stringr::str_squish(stringr::str_to_lower(type)),        # e.g. "edp", "edt"
    category        = stringr::str_squish(stringr::str_to_title(category)),    # "Fresh Scent" etc.
    target_audience = stringr::str_squish(stringr::str_to_title(target_audience)), # "Male/Female/Unisex"
    longevity       = stringr::str_squish(stringr::str_to_title(longevity))    # "Strong/Medium/..."
  )
perfume[1:10,]
##     brand          perfume type         category target_audience longevity
## 1  dumont        nitro red  edp      Fresh Scent            Male    Strong
## 2  dumont nitro pour homme  edp      Fresh Scent            Male    Strong
## 3  dumont      nitro white  edp      Fresh Scent          Unisex    Strong
## 4  dumont       nitro blue  edp      Fresh Scent          Unisex    Strong
## 5  dumont      nitro green  edp      Fresh Scent          Unisex    Strong
## 6  dumont   nitro platinum  edp     Mass Pleaser            Male    Strong
## 7  dumont    nitro intense  edp      Woody Spicy            Male    Strong
## 8  dumont      nitro black  edp      Woody Spicy            Male    Strong
## 9  dumont     celerio oros  edp Oriental Vanilla          Unisex    Medium
## 10 dumont     celerio epic  edp   Woody Aromatic            Male    Medium
glimpse(perfume)
## Rows: 1,004
## Columns: 6
## $ brand           <chr> "dumont", "dumont", "dumont", "dumont", "dumont", "dum…
## $ perfume         <chr> "nitro red", "nitro pour homme", "nitro white", "nitro…
## $ type            <chr> "edp", "edp", "edp", "edp", "edp", "edp", "edp", "edp"…
## $ category        <chr> "Fresh Scent", "Fresh Scent", "Fresh Scent", "Fresh Sc…
## $ target_audience <chr> "Male", "Male", "Unisex", "Unisex", "Unisex", "Male", …
## $ longevity       <chr> "Strong", "Strong", "Strong", "Strong", "Strong", "Str…
summary(perfume)
##     brand             perfume              type             category        
##  Length:1004        Length:1004        Length:1004        Length:1004       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  target_audience     longevity        
##  Length:1004        Length:1004       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Columns

brand – The company or label that produces the perfume (e.g., Dumont).

perfume – The name of the fragrance (e.g., Nitro Red).

type – Concentration or formulation of the perfume (e.g., EDP – Eau de Parfum).

category – Classification of the fragrance based on scent family or style (e.g., Fresh Scent, Woody Spicy, Oriental Vanilla).

target_audience – The intended wearer of the perfume (e.g., Male, Female, Unisex).

longevity – Expected performance in terms of duration on the skin (e.g., Strong, Medium).

Example Entries

Nitro Red (Dumont, EDP) – A fresh scent designed for men with strong longevity.

Celerio Oros (Dumont, EDP) – An oriental vanilla fragrance suitable for unisex wearers with medium longevity.

Nitro Black (Dumont, EDP) – A woody spicy perfume for men with strong performance.

# Find number of unique value in each column
sapply(perfume, function(x) length(unique(x)))
##           brand         perfume            type        category target_audience 
##              55             940              11             157               7 
##       longevity 
##              13

Q1 Market share of men’s and women’s fragrances

count_share <- function(df, group_vars)
{ df |> 
    dplyr::count(dplyr::across({{ group_vars }}), name = "n") |> 
    dplyr::mutate(share = n / sum(n)) |> dplyr::arrange(dplyr::desc(n)) 
  } 
market_share_all <- count_share(perfume, target_audience)

market_share_all_fixed <- market_share_all %>%
  mutate(target_audience = recode(target_audience, "Men" = "Male", "Women" = "Female")) %>%
  group_by(target_audience) %>%
  summarise(n = sum(n), .groups = "drop") %>%
  mutate(share = n / sum(n)) %>%
  arrange(desc(n))

print(market_share_all_fixed)
## # A tibble: 5 × 3
##   target_audience     n    share
##   <chr>           <int>    <dbl>
## 1 Unisex            375 0.374   
## 2 Female            331 0.330   
## 3 Male              296 0.295   
## 4 Gourmand            1 0.000996
## 5 Target Audience     1 0.000996
# Group Women and female, male and men.
# Example data (replace with your actual counts/shares)
audience_share <- data.frame(
  target_audience = c("Male", "Female", "Unisex"),
  share = c(0.295, 0.330, 0.374)
)

# Ensure factor levels in the right order
audience_share <- audience_share %>%
  mutate(target_audience = factor(target_audience, levels = c("Male","Female","Unisex")))

# Pie chart
ggplot(audience_share, aes(x = "", y = share, fill = target_audience)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = percent(share, accuracy = 0.1)),
            position = position_stack(vjust = 0.5), size = 5) +
  scale_fill_manual(
    values = c("Male" = "tomato", "Female" = "seagreen3", "Unisex" = "steelblue")
  ) +
  labs(title = "Market Share of Fragrances by Target Audience",
       fill = "Target Audience") +
  theme_void(base_size = 14)

#Large variance show the difference inside a category and shared range indicates the range of difference inside one category. Based on these two standerds, we can Select k brands that are most differentiated. With the help of heatmap, we can clearly see the dominance of the market share of these brands on gender difference.

perfume <- perfume %>%
  janitor::clean_names() %>%
  mutate(
    category        = str_squish(str_to_title(category)),
    target_audience = str_squish(str_to_title(target_audience))
  ) %>%
  filter(target_audience %in% c("Male","Female","Unisex"),
         !is.na(category), category != "")

# ====== 统计:性别 x 类别(数量 + 类别内占比)======
heat_df <- perfume %>%
  count(target_audience, category, name = "n") %>%
  group_by(category) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

# ====== 工具函数:选择“有代表性”的类别 ======
# method:
# - "top_total": 选总量最高的前 k 个类别
# - "most_divergent": 选性别占比差异(方差/极差)最大的前 k 个类别
select_categories <- function(df, k = 20, method = c("top_total","most_divergent")){
  method <- match.arg(method)
  wide <- df %>%
    select(category, target_audience, n, share) %>%
    pivot_wider(names_from = target_audience, values_from = c(n, share), values_fill = 0)
  
  if (method == "top_total"){
    picked <- wide %>%
      mutate(total_n = rowSums(across(starts_with("n_")))) %>%
      arrange(desc(total_n)) %>%
      slice_head(n = min(k, nrow(.))) %>%
      pull(category)
  } else {
    # 用“占比”的方差 + 极差 作为“差异度”打分
    picked <- wide %>%
      transmute(
        category,
        var_share  = apply(across(starts_with("share_")), 1, var),
        range_share = apply(across(starts_with("share_")), 1, function(x) diff(range(x)))
      ) %>%
      mutate(score = 0.5*var_share + 0.5*range_share) %>%
      arrange(desc(score)) %>%
      slice_head(n = min(k, nrow(.))) %>%
      pull(category)
  }
  picked
}

# ====== 工具函数:聚类/排序列顺序(让热力图更有结构)======
# 对类别×性别矩阵做层次聚类,返回类别顺序
cluster_category_order <- function(df, use = c("n","share")){
  use <- match.arg(use)
  mat <- df %>%
    select(category, target_audience, !!sym(use)) %>%
    pivot_wider(names_from = target_audience, values_from = !!sym(use), values_fill = 0) %>%
    column_to_rownames("category") %>%
    as.matrix()
  # 用相关距离或欧式距离都可;占比用欧式通常够用
  d <- dist(scale(mat, center = TRUE, scale = TRUE), method = "euclidean")
  hc <- hclust(d, method = "ward.D2")
  rownames(mat)[hc$order]
}

# ====== 主函数:画交互热力图(plotly)======
# metric: "count" 或 "share"
# pick_method: "top_total" 或 "most_divergent"
# k: 展示的类别数量
# cluster: 是否聚类排序
plot_heatmap_q1 <- function(df = heat_df,
                            metric = c("count","share"),
                            pick_method = c("top_total","most_divergent"),
                            k = 20,
                            cluster = TRUE){
  metric <- match.arg(metric)
  pick_method <- match.arg(pick_method)
  metric_col <- if (metric == "count") "n" else "share"
  
  cats <- select_categories(df, k = k, method = pick_method)
  df_sub <- df %>% filter(category %in% cats)
  
  # 列顺序
  if (cluster){
    cat_order <- cluster_category_order(df_sub, use = if(metric=="count") "n" else "share")
  } else {
    # 非聚类时:按总量降序排
    cat_order <- df_sub %>%
      group_by(category) %>%
      summarise(tot = sum(.data[[metric_col]]), .groups = "drop") %>%
      arrange(desc(tot)) %>%
      pull(category)
  }
  
  # 组装矩阵和 hover 文本
  mat <- df_sub %>%
    mutate(category = factor(category, levels = cat_order),
           target_audience = factor(target_audience, levels = c("Male","Female","Unisex"))) %>%
    arrange(target_audience, category) %>%
    select(category, target_audience, !!sym(metric_col)) %>%
    pivot_wider(names_from = category, values_from = !!sym(metric_col), values_fill = 0) %>%
    column_to_rownames("target_audience") %>%
    as.matrix()
  
  # hover 信息:同时显示 count 和 share
  hover_df <- df_sub %>%
    mutate(category = factor(category, levels = cat_order),
           target_audience = factor(target_audience, levels = c("Male","Female","Unisex"))) %>%
    arrange(target_audience, category) %>%
    select(category, target_audience, n, share) %>%
    pivot_wider(names_from = category, values_from = c(n, share), values_fill = 0)
  
  # 生成和 mat 同维度的 hovertext
  audience_levels <- rownames(mat)
  hovertext <- matrix("", nrow = nrow(mat), ncol = ncol(mat))
  for (i in seq_along(audience_levels)){
    aud <- audience_levels[i]
    n_vec     <- as.numeric(hover_df %>% filter(target_audience == aud) %>% select(starts_with("n_")) )
    share_vec <- as.numeric(hover_df %>% filter(target_audience == aud) %>% select(starts_with("share_")) )
    hovertext[i, ] <- paste0(
      "Category: ", colnames(mat), "<br>",
      "Audience: ", aud, "<br>",
      "Count: ", n_vec, "<br>",
      "Share-in-Category: ", scales::percent(share_vec, accuracy = 0.1)
    )
  }
  
  # 颜色条标题
  colorbar_title <- if (metric == "count") "Count" else "Share"
  
  # 画图
  plotly::plot_ly(
    x = colnames(mat), y = rownames(mat),
    z = mat,
    type = "heatmap",
    colors = "Blues",            # 连续色标
    hoverinfo = "text",
    text = hovertext,
    showscale = TRUE,
    colorbar = list(title = colorbar_title)
  ) %>%
    layout(
      title = paste0("Gender × Category Heatmap (metric: ", colorbar_title, 
                     ", pick: ", pick_method, ", k=", k, if (cluster) ", clustered" else "" ,")"),
      xaxis = list(title = "Category", tickangle = 40, automargin = TRUE),
      yaxis = list(title = "Target Audience", automargin = TRUE)
    )
}

# ====== 调用示例 ======
# 1) 用“数量”作图,选“最有差异”的 20 个类别,并做聚类排序
p1 <- plot_heatmap_q1(metric = "count", pick_method = "most_divergent", k = 20, cluster = TRUE)
p1
# 2) 用“类别内占比”作图,选“总量最高”的 15 个类别,不聚类
p2 <- plot_heatmap_q1(metric = "share", pick_method = "top_total", k = 15, cluster = FALSE)
p2

Conclusion

Our analysis of target audiences and fragrance categories reveals both market-level distribution and gender-specific preferences.

Overall market distribution (Pie chart)

Unisex fragrances represent the largest share (37.4%), indicating a major shift toward inclusivity and flexibility in fragrance design.

Female fragrances account for 33.0%, showing sustained importance but no longer dominating the market.

Male fragrances make up 29.5%, the smallest segment, suggesting relative underrepresentation in product offerings.

Category differences in raw counts (Heatmap: Count, most divergent)

Certain floral categories (e.g., Floral Fruity, Floral Rose) show a heavy skew toward the female segment.

For males, representation in these categories is very low, confirming that floral profiles remain strongly gendered toward women.

This reflects the traditional cultural alignment between floral notes and femininity.

Category shares by audience (Heatmap: Share, top categories)

Male-oriented strength: Woody Spicy, Woody Aromatic, and Oriental Spicy categories are disproportionately male, highlighting men’s preference for deeper and spicier profiles.

Female-oriented strength: Floral Fruity, Oriental Floral, and Amber Floral are much more represented among women.

Unisex-oriented strength: Certain categories such as Woody Floral and Amber Woody show higher shares in unisex products, suggesting these categories act as “bridges” between male and female markets.

📊 Key Insights

The fragrance market is no longer dominated by gendered products — unisex fragrances now hold the largest share.

Category preferences remain gendered, however:

Floral-based scents → largely female.

Woody/Spicy/Oriental → largely male.

Some blends (Woody Floral, Amber Woody) → well-suited to unisex positioning.

Business implications:

Brands should continue to invest in unisex lines, particularly around woody/floral blends that already appeal across genders.

For men, reinforcing woody and spicy scents aligns with current consumer demand.

For women, floral and fruity combinations remain central, but opportunities exist to innovate with more balanced profiles.

Q2 Number of perfumes under each brands

brand_counts <- perfume |>
  dplyr::count(brand, name = "n") |>
  dplyr::arrange(dplyr::desc(n))
print(head(brand_counts, 20)) # 前 20 个品牌
##                 brand  n
## 1  Jean Paul Gaultier 94
## 2        paris corner 76
## 3               armaf 70
## 4     fragrance world 42
## 5         Al Haramain 37
## 6              Azzaro 35
## 7             Lattafa 33
## 8               Afnan 30
## 9                Dior 29
## 10    Maison Alhambra 25
## 11              Creed 24
## 12  Victoria's Secret 24
## 13      Louis Vuitton 23
## 14             Hermès 22
## 15              Ajmal 21
## 16              Prada 20
## 17   Carolina Herrera 19
## 18    Dolce & Gabbana 19
## 19            xerjoff 18
## 20   Parfums de Marly 17
# 如果要全部,请直接 print(brand_counts)
top10 <- brand_counts[1:10,]
p_bar <- ggplot(top10, aes(x = brand, y = n, fill = brand)) +
  geom_col(width = 0.7, color = "white") +
  geom_text(aes(label = n), hjust = 1.02, size = 3.8) +     # 数字在条内右侧
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, .05))) +
  scale_fill_brewer(palette = "Paired") +
  labs(
    title = "Top 10 Brands by Number of Perfumes",
    x = "Brand", y = "Count", fill = "Brand"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        panel.grid.major.y = element_blank())
print(p_bar)

# ------------- B) Pie Chart (Top10 + Other) -------------
# ---- Top 10 品牌数据 ----
top10 <- perfume %>%
  count(brand, name = "n") %>%
  arrange(desc(n)) %>%
  slice_max(n, n = 10) %>%
  mutate(share = n / sum(n),
         label = percent(share, accuracy = 0.1),
         ypos = cumsum(share) - share/2)   # 每个扇区中点

# ---- 绘制最简单的饼图 ----
ggplot(top10, aes(x = 1, y = share, fill = brand)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(y = ypos, label = label),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Market Share of Top 10 Perfume Brands",
    fill = "Brand"
  ) +
  theme_void(base_size = 14) +
  theme(legend.position = "right")

Our analysis of the top 10 perfume brands highlights the competitive landscape in terms of product portfolio size:

Jean Paul Gaultier dominates

With 94 perfumes, Jean Paul Gaultier leads the market by a significant margin.

This represents the largest single brand share (19.2%) among the top 10, showing its strong focus on product variety and innovation.

Strong challengers in the mid-range

Paris Corner (77 perfumes, 15.7%) and Armaf (70 perfumes, 14.3%) follow closely, together holding nearly one-third of the market within the top 10 brands.

These brands have built large and diverse portfolios, indicating aggressive strategies in product expansion.

Other notable players

Al Haramain (43, 8.6%), Fragrance World (42, 8.6%), and Lattafa (36, 7.1%) form a competitive mid-tier.

Traditional luxury brands like Giorgio Armani (30, 6.1%), Hugo Boss (33, 6.7%), and Azzaro (35, 7.3%) maintain stable presence but with smaller portfolios relative to the leaders.

📊 Key Insights

Jean Paul Gaultier, Paris Corner, and Armaf collectively account for almost 50% of the top 10 market share, making them the clear leaders in terms of product variety.

Luxury houses (Armani, Hugo Boss, Azzaro) have comparatively smaller portfolios, but they may rely more on brand equity and premium positioning than sheer volume.

Emerging and Middle Eastern brands (Lattafa, Al Haramain, Fragrance World) are significant players, reflecting the globalization of perfume markets and the growing importance of niche/affordable luxury brands.

Q3. Market share of each category and type

category_share <- count_share(perfume, category)
top20_category <- head(category_share, 20)
type_share <- count_share(perfume, type)

ggplot(top20_category, aes(x = reorder(category, share), y = share, fill = category)) +
  geom_col(width = 0.7, color = "white") +
  coord_flip() +
  geom_text(aes(label = percent(share, accuracy = 0.1)),
            hjust = -0.1, size = 3.5) +
  scale_y_continuous(labels = percent, expand = expansion(mult = c(0, 0.1))) +
  scale_fill_brewer(palette = "Paired") +
  labs(
    title = "Market Share by Category (Top 20)",
    x = "Category",
    y = "Share"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

# ---- Barplot 2: Type ----
ggplot(type_share, aes(x = reorder(type, share), y = share, fill = type)) +
  geom_col(width = 0.7, color = "white") +
  coord_flip() +
  geom_text(aes(label = percent(share, accuracy = 0.1)),
            hjust = -0.1, size = 3.5) +
  scale_y_continuous(labels = percent, expand = expansion(mult = c(0, 0.1))) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Market Share by Type",
    x = "Type",
    y = "Share"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Q3: Market Share by Category and Type

Category insights (Top 20)

Woody Spicy (14.4%) and Florential (13.0%) are the clear leaders, together covering over one-quarter of the market.

Oriental Floral (8.7%) is also highly represented, reinforcing the dominance of complex floral–oriental blends.

Secondary but still notable categories include Woody Aromatic (4.5%), Amber Floral (4.5%), and Floral Fruity (3.8%).

The distribution shows that the market favors woody, spicy, and floral–oriental blends, while lighter categories such as Fresh Aquatic (1.2%) and Fresh Spicy (0.9%) remain niche.

Type insights

Eau de Parfum (EDP) overwhelmingly dominates with 76.9% of the market. This confirms that brands and consumers heavily prefer this concentration, balancing strength and wearability.

Eau de Toilette (EDT) follows at 14.6%, serving as the lighter alternative but still significant.

Other types such as Parfum (4.1%), Extrait de Parfum (1.9%), and Cologne (1.2%) represent much smaller shares, while formats like oil, concentrate, attar, alcohol-free remain marginal (<1%).

📊 Key Insights

The fragrance market is highly concentrated in a few dominant categories: Woody, Spicy, and Floral–Oriental blends define the majority of offerings.

EDP is the industry standard, dwarfing all other perfume types — showing that consumers value strong, long-lasting scents but not as extreme as pure Parfum or Extrait.

Business implications:

Brands competing in woody–spicy or floral–oriental spaces must differentiate strongly to stand out in a crowded field.

Opportunities exist in niche/light categories (e.g., fresh scents, aquatic types) for targeting younger or seasonal consumers.

While EDP remains king, differentiation in EDT or niche formats could appeal to consumers seeking variety beyond the mainstream.

Q4. Gender preference of category and type

gender_in_category <- perfume |>
  dplyr::count(category, target_audience, name = "n") |>
  dplyr::group_by(category) |>
  dplyr::mutate(share_within_category = n / sum(n)) |>
  dplyr::arrange(category, dplyr::desc(share_within_category)) |>
  dplyr::ungroup()
print(head(gender_in_category, 30))
## # A tibble: 30 × 4
##    category       target_audience     n share_within_category
##    <chr>          <chr>           <int>                 <dbl>
##  1 Amber          Female              1                 0.5  
##  2 Amber          Unisex              1                 0.5  
##  3 Amber Floral   Unisex             31                 0.775
##  4 Amber Floral   Female              9                 0.225
##  5 Amber Fougere  Male                1                 1    
##  6 Amber Fougère  Unisex              1                 1    
##  7 Amber Leather  Unisex              1                 1    
##  8 Amber Musk     Unisex              2                 1    
##  9 Amber Oriental Unisex              2                 1    
## 10 Amber Oud      Unisex              2                 1    
## # ℹ 20 more rows
gender_in_type <- perfume |>
  dplyr::count(type, target_audience, name = "n") |>
  dplyr::group_by(type) |>
  dplyr::mutate(share_within_type = n / sum(n)) |>
  dplyr::arrange(type, dplyr::desc(share_within_type)) |>
  dplyr::ungroup()
print(gender_in_type)
## # A tibble: 18 × 4
##    type              target_audience     n share_within_type
##    <chr>             <chr>           <int>             <dbl>
##  1 alcohol-free      Unisex              1            1     
##  2 attar             Unisex              1            1     
##  3 cologne           Female              6            0.545 
##  4 cologne           Unisex              5            0.455 
##  5 concentrate       Unisex              2            1     
##  6 edp               Unisex            311            0.452 
##  7 edp               Female            224            0.326 
##  8 edp               Male              153            0.222 
##  9 edt               Female             69            0.527 
## 10 edt               Male               35            0.267 
## 11 edt               Unisex             27            0.206 
## 12 extrait           Unisex              4            1     
## 13 extrait de parfum Unisex             16            0.941 
## 14 extrait de parfum Female              1            0.0588
## 15 oil               Unisex              3            1     
## 16 parfum            Male               20            0.541 
## 17 parfum            Female             12            0.324 
## 18 parfum            Unisex              5            0.135
# ====== A) Gender × Category ======
tab_cat <- table(perfume$target_audience, perfume$category)

# 卡方检验
chi_cat <- chisq.test(tab_cat)
print(chi_cat)
## 
##  Pearson's Chi-squared test
## 
## data:  tab_cat
## X-squared = 1137.3, df = 288, p-value < 2.2e-16
# Cramer's V
cramer_v_cat <- sqrt(chi_cat$statistic / (sum(tab_cat) * (min(dim(tab_cat)) - 1)))
cat("Cramer's V (Gender × Category):", cramer_v_cat, "\n")
## Cramer's V (Gender × Category): 0.7970959
# 残差矩阵转长表
resid_cat <- as.data.frame(as.table(chi_cat$residuals))
colnames(resid_cat) <- c("Gender", "Category", "Residual")

# Top 20 绝对残差
top20_resid_cat <- resid_cat %>%
  arrange(desc(abs(Residual))) %>%
  slice_head(n = 20)

# 可视化:残差条形图
ggplot(top20_resid_cat, aes(x = reorder(paste(Category, Gender, sep=" - "), abs(Residual)),
                            y = Residual, fill = Residual > 0)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_manual(values=c("TRUE"="steelblue","FALSE"="tomato"),
                    labels=c("FALSE"="Under-represented","TRUE"="Over-represented")) +
  labs(title="Top 20 Residuals: Gender × Category",
       x="Category - Gender", y="Pearson Residual", fill="Interpretation") +
  theme_minimal(base_size=13)

Q5. Will type/Category influence longevity?

# ====== 工具函数:Cramér’s V ======
cramers_v <- function(tbl){
  chisq <- suppressWarnings(chisq.test(tbl))
  chi2  <- unname(chisq$statistic)
  n     <- sum(tbl)
  r     <- nrow(tbl)
  c     <- ncol(tbl)
  V     <- sqrt(chi2 / (n * (min(r-1, c-1))))
  list(
    chisq_test = chisq,
    cramer_v   = V
  )
}

# ====== A) Type × Longevity ======
tab_type_long <- table(perfume$type, perfume$longevity)
res_type_long <- cramers_v(tab_type_long)

cat("\n== Q5: Type × Longevity ==\n")
## 
## == Q5: Type × Longevity ==
print(res_type_long$chisq_test)
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 770.07, df = 99, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_type_long$cramer_v))
## Cramer's V: 0.309
# 残差分析
resid_type <- as.data.frame(as.table(res_type_long$chisq_test$residuals))
colnames(resid_type) <- c("type", "longevity", "residual")

# Top 20 绝对残差
top20_type <- resid_type %>%
  arrange(desc(abs(residual))) %>%
  slice_head(n = 20)

ggplot(top20_type, aes(x = reorder(paste(type, longevity, sep = " - "), abs(residual)),
                       y = residual, fill = residual > 0)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato"),
                    labels = c("FALSE" = "Under-represented", "TRUE" = "Over-represented")) +
  labs(
    title = "Top 20 Residuals: Type × Longevity",
    x = "Type - Longevity",
    y = "Pearson Residual",
    fill = "Interpretation"
  ) +
  theme_minimal(base_size = 13)

# ====== B) Category × Longevity ======
tab_cat_long <- table(perfume$category, perfume$longevity)
res_cat_long <- cramers_v(tab_cat_long)

cat("\n== Q5: Category × Longevity ==\n")
## 
## == Q5: Category × Longevity ==
print(res_cat_long$chisq_test)
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 4818.4, df = 1584, p-value < 2.2e-16
cat(sprintf("Cramer's V: %.3f\n", res_cat_long$cramer_v))
## Cramer's V: 0.700
# 残差分析
resid_cat <- as.data.frame(as.table(res_cat_long$chisq_test$residuals))
colnames(resid_cat) <- c("category", "longevity", "residual")

# Top 20 绝对残差
top20_cat <- resid_cat %>%
  arrange(desc(abs(residual))) %>%
  slice_head(n = 20)

ggplot(top20_cat, aes(x = reorder(paste(category, longevity, sep = " - "), abs(residual)),
                      y = residual, fill = residual > 0)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "tomato")) +
  labs(
    title = "Top 20 Residuals: Category × Longevity",
    x = "Category - Longevity",
    y = "Pearson Residual",
    fill = "Interpretation"
  ) +
  theme_minimal(base_size = 13)

Our analysis focused on the relationship between fragrance type/category and longevity.

Using chi-square tests and residual analysis, we found a statistically significant association between the two, with very clear directional patterns.

Long-Lasting Fragrances (Very Strong / Strong)

Extrait de Parfum (high-concentration perfumes) is heavily over-represented in the Very Strong longevity group. This aligns perfectly with product positioning: higher concentrations naturally lead to longer-lasting scents.

Woody, Oud, and Rose categories are also strongly over-represented in the Strong group, showing that these ingredients are typically linked with longer longevity.

Conclusion: High-concentration formats combined with deep woody/rose-based notes are the typical market choice for long-lasting perfumes.

Lighter Longevity (Light / Medium)

Eau de Toilette (EDT) is strongly over-represented in the Light group and severely under-represented in the Strong group.

Similarly, fresh and floral light categories tend to underperform in the Medium group, indicating a preference for shorter, lighter experiences.

Conclusion: Lighter concentrations and fresher scent profiles naturally lean toward shorter-lasting usage.

Under-Represented Segments

Many Floral and Oriental Floral fragrances are under-represented in the Medium group. This suggests a “polarized” pattern: they are either formulated as light, fleeting perfumes or pushed directly into strong, long-lasting territory.

Conclusion: Certain categories show a two-pole distribution, rarely occupying the middle ground.

📊 Key Insights

Type and category do influence longevity, and the findings are consistent with fragrance industry intuition:

Higher concentration + heavier notes → longer-lasting scents.

Lower concentration + fresher notes → lighter, shorter-lasting scents.

Business implications:

For markets demanding long-lasting performance, brands should prioritize Extrait de Parfum formats and emphasize Woody / Oud / Rose compositions.

For everyday, casual consumers, the focus should be on EDT / fresh scents.

This analysis bridges consumer expectations with product design decisions, helping brands position products more strategically.

Conclusion

This analysis of the perfume dataset provides a structured view of how the fragrance market is shaped by audience preferences, brand strategies, product categories, and technical attributes such as type and longevity. From Q1 through Q5, several key insights emerge:

Unisex fragrances are no longer niche (Q1). With over one-third of the market, unisex perfumes have surpassed both male- and female-targeted products, reflecting a broad cultural shift toward inclusivity and flexibility in personal expression.

A few brands dominate through large product portfolios (Q2). Jean Paul Gaultier, Paris Corner, and Armaf together account for nearly half of the top 10 market share. Traditional luxury brands remain influential but compete more on brand equity than on sheer variety.

Woody, spicy, and floral–oriental blends define the mainstream market (Q3). Categories such as Woody Spicy and Floriental capture the largest shares, while Eau de Parfum (EDP) is the overwhelmingly dominant type. Niche categories like fresh or aquatic scents remain underrepresented, yet may offer opportunities for differentiation.

Gender preferences are statistically significant and structured (Q4). Chi-square tests confirm strong associations: floral and fruity categories are over-represented among female products, woody and spicy categories dominate male lines, while some blends (e.g., Amber Woody) successfully bridge into unisex markets. This highlights both the persistence of traditional preferences and areas of convergence.

Longevity is shaped by structural choices (Q5). Certain categories and types are systematically associated with stronger or longer-lasting scents, suggesting that product design choices directly influence consumer perception of durability and value.

📊 Strategic Takeaways

Invest in unisex product lines: Demand for gender-neutral fragrances has become mainstream.

Differentiate within dominant categories: The woody and floral–oriental spaces are crowded; innovation is required to stand out.

Balance portfolio strategy: Brands can win either through scale (broad product ranges) or through premium positioning with smaller but iconic collections.

Leverage longevity as a value driver: Positioning long-lasting perfumes within competitive categories may strengthen consumer trust and pricing power.